merge conflict resolution with main by waybarrios · Pull Request #36 · computor-org/vllm-mlx

waybarrios · 2026-04-14T02:00:20Z

Summary

Resolves the 7 merge conflicts between fix/chat-template-kwargs-forwarding and main on waybarrios/vllm-mlx.

Integrates enable_thinking support (from main) alongside chat_template_kwargs forwarding (from this branch) in batched.py
Adopts main's _run_blocking_serialized refactor in simple.py while preserving chat_template_kwargs forwarding
Forwards chat_template_kwargs through the new tool-stall workaround path in simple.py

See full details: waybarrios#218 (comment)

Large model downloads via huggingface_hub often hang or fail around 10GB. This adds a pre-download step with configurable retry/timeout before load_model() is called, so interrupted downloads can be resumed. New CLI flags for `serve`: --download-timeout, --download-retries, --offline New subcommand: `vllm-mlx download <model>` for pre-warming caches Closes #75 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The output_token_ids from AsyncEngineCore were tracked internally but never forwarded to GenerationOutput, leaving tokens always []. Also adds tests for the generate() output fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Parse MiniMax-M2.5's XML tool call format: <minimax:tool_call> <invoke name="function"> <parameter name="arg">value</parameter> </invoke> </minimax:tool_call> Handles single/multiple tool calls, JSON parameter values, no-parameter calls, and preserves <think> blocks. 9 unit tests included.

…n streaming parser The streaming reasoning parser (BaseThinkingReasoningParser) scans the full accumulated output text for <think>/<think> on every token via `in` checks on previous_text and current_text. This is O(N) per token and O(N²) over a full generation, becoming measurable at longer outputs (5ms+ at 2k tokens, 141ms at 10k tokens). Replace with a three-phase state machine (pre_think → thinking → content) that tracks transitions using only the delta text. Each token is now O(1) regardless of output length. Benchmark results (streaming parser overhead, simulated server loop): Tokens Old (scan) New (state) Speedup ------ ---------- ----------- ------- 500 0.37ms 0.04ms 8.6x 1000 1.38ms 0.10ms 13.5x 2000 5.28ms 0.28ms 19.1x 5000 34.03ms 2.05ms 16.6x 10000 141.26ms 10.16ms 13.9x At 50 tok/s decode on Apple Silicon, each token has a 20ms budget. The old parser consumed 0.3ms/tok at 2k tokens and 1.4ms/tok at 10k — up to 7% of the budget on overhead alone. The new parser is <0.01ms/tok at any length. Changes: - think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with _phase tracking. reset_state() initializes the phase. All three scenarios preserved (explicit tags, implicit mode, no tags). Method signature unchanged for backward compatibility. - benchmarks/bench_reasoning_parser.py: Added streaming parser benchmark. No changes to extract_reasoning() (non-streaming path) — it only runs once per request and is not on the hot path.

Add _normalize_messages() to server.py and call it in all request paths before apply_chat_template. Maps non-standard roles (developer -> system, per OpenAI Responses API) and merges consecutive same-role messages. Fixes agent crashes from: - OpenAI Responses API sending role="developer" (unrecognized by Qwen3.5 template) - OpenCode sending [system, system, user, user] (rejected by alternating-role templates) Applied in create_chat_completion (both MLLM and LLM paths), create_anthropic_message, and _stream_anthropic_messages.

Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This depends on PR 215 or PR 243 being applied first.

error responses with token=0 were falling through to the detokenizer and decoding garbage text. now they skip decoding and set the request status to FINISHED_ABORTED. added a test for this case. also ran black on batched.py to fix CI.

feat: add Gemma 4 multimodal model support

- Fix BatchKVCache offset bug: mx.array.__iadd__ mutates in-place, causing incorrect RoPE positions and token repetition - Fix RotatingKVCache.max_size returning mx.array instead of int - Add Gemma 4 reasoning parser (--reasoning-parser gemma4) - Read additional EOS tokens from generation_config.json - Fix RotatingKVCache prefix cache extraction (negative left_padding) - Relax isinstance guard to accept RotatingKVCache for sliding window models like Gemma 4 (fixes ValueError on continuous batching) - Remove unused _make_batch_cache() dead code - Fix Anthropic endpoint JSON parsing for clients sending invalid escape sequences (e.g. \s, \d in regex patterns within tool defs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache

* test: add Gemma 4 tool parser tests (red) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Gemma 4 tool call parser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: register Gemma 4 parser, add streaming tests and wiring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add edge case tests for DC review findings - Unclosed tool call block (server fallback path) - String containing colon (step-ordering guard) - String with real newline and double quote (JSON escaping) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: verify Gemma 4 tool calls produce exact OpenAI format for Claude Code Integration tests that verify the full pipeline (parser → server models → JSON serialization) matches what Claude Code expects: tool_calls structure, null content, function.arguments as JSON string, correct finish_reason. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add Gemma 4 auto-detection to AutoToolParser integrates Gemma 4 format as the first format tried in auto-detection, adds streaming markers for tool call start/end. based on keegoid's approach in #254. * remove unused pytest imports * run black on tool parser, tests, and server --------- Co-authored-by: Jack Neil <jackneil@Jacks-Mac-Studio.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Wayner Barrios <waybarrios@gmail.com>

…258) Extends MLLM batch generator to support top_k, min_p, and presence_penalty alongside the existing repetition_penalty. This gives the MLLM path full parity with the LLM/SimpleEngine sampling parameter coverage. Changes: - MLLMBatchRequest: add top_k, min_p, presence_penalty fields - MLLMBatch: add per-request samplers list (filter/extend support) - _process_prompts: build per-request logits processors for presence_penalty and per-request samplers for top_k/min_p - _step: accept and apply per-request samplers - SamplingParams: add presence_penalty field - MLLMScheduler: propagate new params from kwargs to batch requests - BatchedEngine: pass new params through generate/stream_generate When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Qwen3.5 hybrid paged cache reconstruction * fix: add deduplication safety test and remove duplicate tokenizer hunk Add test confirming deduplicated terminal blocks correctly isolate recurrent state per sequence. Remove the duplicate tokenizer return fix that already ships in PR #215. * style: format hybrid cache follow-up

* fix: keep simple engine serialized across cancellation (#8) * fix: avoid nested simple engine generation locks * fix: catch BaseException in cancellation handler, fix async test markers _run_blocking_serialized catches CancelledError (a BaseException subclass) from the outer scope, but the inner try/except used Exception which would let a second CancelledError during await task escape unhandled. Changed to BaseException to suppress any exception from the draining await. Also fix test_simple_engine.py to use pytest.mark.anyio instead of pytest.mark.asyncio (pytest-asyncio is not configured), and add the anyio_backend fixture to conftest.py restricting to asyncio only since trio is not installed. * fix: preserve prompt token accounting after upstream refresh * fix: restore specprefill fallback helper scope

…#221) * Fix chunked prefill for mlx-lm prompt checkpoints * fix: invoke prompt_checkpoint_callback in chunked-prefill path The upstream BatchGenerator contract requires prompt_checkpoint_callback to fire after cache finalization, before the checkpoint tail model call. The chunked-prefill monkeypatch preserved the checkpoint field but never invoked the callback, breaking the upstream checkpoint contract. Wire _lazy_extract_cache from mlx-lm and invoke the callback at the correct semantic boundary. Add regression test verifying the callback fires with the correct uid and checkpoint offset. * test: cover checkpoint tail replay on upstream refresh * style: format prompt checkpoint refresh * fix: tolerate mlx-lm Batch export drift in chunked prefill

…is mllm

fix: populate tokens field in BatchedEngine.generate()

Upgrade mlx-vlm and torchvision so Qwen3.5 multimodal will run

* fix(server): integrate tool call parser into reasoning parser streaming path * use _model_name instead of request.model in reasoning tool chunk --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com>

When tool_choice='none', models should never return tool calls. Two fixes: 1. Strip tools from chat template context — prevents templates from activating tool-call token generation. 2. Suppress tool call parsing — _parse_tool_calls_with_parser() returns early with no tools, streaming parser skips initialization. Applied across all server paths: chat completions (streaming + non-streaming), Anthropic adapter (streaming + non-streaming). Fixes #162

) Claude Code injects `x-anthropic-billing-header: cc_version=...; cch=HASH;` into the system prompt. The `cch=` hash changes with every request, causing token sequences to diverge at position ~40 and completely defeating prefix cache reuse across turn boundaries. Strip this header before tokenization so consecutive requests from the same conversation share 99%+ of their token prefix. Result: 50s → 3.65s per request (13.7x speedup) on Gemma 4 26B-A4B with 60K-token prompts. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…fill New files: - patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5 - patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5 - tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser - scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16 Key changes: - mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks, patch registration, repetition penalty, prefill abort, think-suffix stripping for prefix cache - mllm_scheduler: request status, cache config, prefill abort - server: enable_thinking, tool_choice=none, tool argument coercion - engines: MTP injection, enable_thinking, gpu_memory_utilization - memory_cache: block LCP for hybrid models (SSM can't be rewound) Prefix cache fix: enable_thinking=True adds <think>\n to generation prompt, breaking PREFIX match across conversation turns. Strip these tokens from cache keys in both store and fetch paths so stored entries match as clean prefixes. Tested: 3.12s → 0.39s (8x) for 1400-token prompts on Qwen3.5-122B hybrid model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-format Add <function=name> format support to Qwen tool parser

fix: import Path in tokenizer utils

…-leak fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak

…e-machine perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)

feat: add MiniMax tool call parsing support

…r-restack cli: expose harmony and gpt-oss tool parsers

* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts

The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.

…stack simple-engine: keep tool chat on the streaming execution path

…, repetition_penalty) (#213) Pass all OpenAI-compatible sampling parameters through to mlx-lm's make_sampler and make_logits_processors. Previously only temperature, top_p and max_tokens reached the engine — top_k, min_p, presence_penalty and repetition_penalty were silently dropped. Changes: - api/models.py: Add fields to ChatCompletionRequest and CompletionRequest - request.py: Add presence_penalty to SamplingParams dataclass - server.py: Extract and pass all params in every code path (6 locations), log all params on request - models/llm.py: Build sampler with top_k/min_p, build logits_processors for presence_penalty/repetition_penalty - engine/simple.py: Fix enable_thinking to read VLLM_MLX_ENABLE_THINKING env var instead of hardcoding based on model name Tested with all 4 Unsloth Qwen 3.5 sampling profiles on 122B model.

* compatibility with mlx-lm 0.31.x BatchGenerator API The backport in f61d34e assumed internal BatchGenerator APIs that were refactored in mlx-lm 0.31.x. This breaks bench and serve for all users on v0.2.7. Changes: - Set prompt_progress_callback as instance attribute instead of passing it to BatchGenerator constructor (not a valid parameter) - Guard _install_chunked_prefill with hasattr check and log warning when skipped (relies on removed _process_prompts, active_batch) - Handle next() returning (prompt_responses, generation_responses) tuple instead of flat list - Add hasattr guard for active_batch in periodic cache eval Benchmark (Llama-3.2-1B-Instruct-4bit, mlx-lm 0.31.2): Total time: 2.38s Prompts: 10 Prompts/second: 4.19 Total prompt tokens: 80 Total completion tokens: 960 Total tokens: 1040 Tokens/second: 402.52 Throughput: 436.06 tok/s Closes #293 * bump to 0.2.8

* feat: add --prefill-step-size CLI flag Expose prefill_step_size as a CLI argument for both serve and bench commands. Default of 0 means "use engine default" (2048 for LLM, 1024 for MLLM), preserving existing behavior. Vision models routinely exceed 1024 tokens per prompt (images alone contribute 1400+), hitting the MLLM batch generator's safe limit. This flag lets users raise the limit without patching source code. * Clarify MLLM prefill step override behavior * refactor: clarify MLLM prefill CLI flag and validate override

…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

* feat(api): per-request SpecPrefill overrides on /v1/completions ChatCompletionRequest already accepts per-request `specprefill` and `specprefill_keep_pct` overrides, and /v1/chat/completions threads them into engine.chat(). CompletionRequest does not, so /v1/completions clients cannot opt a single request into (or out of) SpecPrefill, nor tune the keep percentage per request. Changes: - vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to CompletionRequest, matching the existing ChatCompletionRequest fields. - vllm_mlx/server.py::create_completion: extract both and thread into engine.generate(**gen_kwargs), mirroring the pattern used at server.py:1421 in create_chat_completion. - vllm_mlx/server.py::stream_completion: apply the same extraction so streaming /v1/completions clients get the same control. Both new fields default to None, so existing behavior is unchanged for clients that do not set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already consumes these kwargs (see simple.py:307-308). * style(server): align completions kwargs handling

…ialized)

qodo-code-review · 2026-04-14T02:01:08Z

Review Summary by Qodo

Prefix cache, MTP support, tool improvements, and streaming enhancements with comprehensive testing

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• **Prefix cache and KV cache optimization**: Added MemoryAwarePrefixCache support with chat
  template normalization, cache extraction/merging with RotatingKVCache buffer trimming, and hybrid
  model support (ArraysCache)
• **Multi-Token Prediction (MTP) support**: Implemented MTP injection for Qwen3.5 models via
  inject_mtp_support(), added MTP weight extraction script for Dense/MoE architectures, and
  integrated MTP into scheduler with always-advance verification
• **Chunked prefill with prompt checkpoints**: Enhanced chunked prefill to support prompt
  checkpoints (positive values for token positions, non-positive for offsets) with checkpoint tail
  replay and callback support
• **Client disconnect detection**: Added PrefillAbortedError exception and prefill abort tracking
  with heartbeat SSE comments in _disconnect_guard() to detect disconnects during long prefill
  operations
• **Tool call improvements**: Implemented tool argument coercion via _coerce_tool_arguments() to
  fix LLM tool failures, added Gemma 4 and Qwen function format parsers with streaming buffering
  support, and improved tool call filtering
• **Message normalization**: Added _normalize_messages() to map non-standard roles (e.g.,
  "developer" → "system") and merge consecutive same-role messages for chat template compatibility
• **Reasoning parser enhancements**: Refactored streaming parser to state-machine approach with
  three phases (pre_think, thinking, content), added Gemma 4 reasoning parser support
• **Per-request sampling parameters**: Added forwarding of top_k, min_p, presence_penalty,
  repetition_penalty through all generation paths (completion, chat, streaming variants)
• **GPU memory utilization configuration**: Added --gpu-memory-utilization flag and dynamic memory
  pressure threshold calculation for Metal allocation limits
• **Model download utilities**: Implemented download_command() with retry logic, timeout, and
  offline mode support; optimized VLM loading with up-front detection to avoid double-loading penalty
• **Blocking operation refactoring**: Added _run_blocking_serialized() method for safe MLX
  operations under generation lock with proper cancellation handling
• **Comprehensive test coverage**: Added tests for chunked prefill checkpoints, Gemma 4
  tool/reasoning parsers, streaming chat completion, message normalization, download utilities, and
  streaming aggregation

Diagram

flowchart LR
  A["Prefix Cache<br/>KV Optimization"] --> B["Batch Generator<br/>Chunked Prefill"]
  C["MTP Injection<br/>Qwen3.5"] --> B
  D["Tool Parsers<br/>Gemma4/Qwen"] --> E["Server<br/>Tool Coercion"]
  F["Message<br/>Normalization"] --> E
  G["Reasoning<br/>State Machine"] --> E
  B --> H["Scheduler<br/>Request Tracking"]
  H --> I["Engine<br/>Sampling Parameters"]
  J["GPU Memory<br/>Utilization"] --> I
  K["Download<br/>Utilities"] --> L["CLI<br/>Model Pre-download"]
  L --> I
  M["Blocking Ops<br/>Serialization"] --> I

File Changes

1. vllm_mlx/mllm_batch_generator.py ✨ Enhancement +1177/-90

Prefix cache, chunked prefill, abort handling, and MTP support

• Added PrefillAbortedError exception and prefill abort tracking (_aborted_request_ids) to
 support client disconnect handling during long prefill operations
• Implemented _run_chunked_text_prefill() for text-only requests with real-time progress tracking
 via _prefill_progress dictionary
• Integrated KV prefix cache support with MemoryAwarePrefixCache, including chat template
 normalization for Qwen3.5 and think-suffix computation
• Enhanced _process_prompts() with per-request error handling, prefix cache lookup/hit logic, and
 per-request logits processors/samplers for sampling parameters (top_k, min_p, presence_penalty,
 repetition_penalty)
• Added _maybe_store_prefix_cache() to persist finished request caches and install_mtp_mllm()
 for multi-token prediction support with always-advance verification strategy
• Improved cache extraction and merging with RotatingKVCache buffer trimming and hybrid model
 support (ArraysCache)

vllm_mlx/mllm_batch_generator.py

2. vllm_mlx/server.py ✨ Enhancement +614/-165

Tool argument coercion, message normalization, disconnect detection

• Added _coerce_tool_arguments() to fix LLM tool call failures by JSON-stringifying object/array
 values when schema expects strings
• Implemented _normalize_messages() to map non-standard roles (e.g. "developer" → "system") and
 merge consecutive same-role messages for chat template compatibility
• Enhanced _disconnect_guard() with heartbeat SSE comments to detect client disconnects during
 long prefill by forcing ASGI writes
• Refactored Anthropic streaming (_stream_anthropic_messages()) to use reasoning parser for
 thinking blocks and improved tool call filtering with _TOOL_MARKUP_PATTERN
• Added per-request sampling parameters (top_k, min_p, presence_penalty, repetition_penalty) to
 completion and chat endpoints
• Improved tool call parsing with tool_choice="none" support and schema-aware argument coercion in
 both streaming and non-streaming paths

vllm_mlx/server.py

3. vllm_mlx/engine/batched.py ✨ Enhancement +146/-8

MTP injection, memory utilization config, sampling parameters

• Added gpu_memory_utilization parameter to control Metal memory allocation limits (default 0.90)
• Implemented _inject_mtp_mllm() to inject MTP weights into MLLM language models for multi-token
 prediction support
• Enhanced MLLM scheduler config with cache memory, MTP, and KV quantization settings; added prefill
 step size override
• Extended _apply_chat_template() to support per-request enable_thinking parameter with coder
 model detection
• Added sampling parameters (top_k, min_p, presence_penalty, repetition_penalty) forwarding to
 generate/stream_generate/chat/stream_chat methods
• Improved stats collection to promote MLLM metrics (running, num_requests, cache stats) to
 top-level for monitoring

vllm_mlx/engine/batched.py

View more (59)

4. vllm_mlx/models/mllm.py ✨ Enhancement +7/-3

SHA256 image hashing and enable_thinking parameter

• Changed base64 image hashing from MD5 to SHA256 to prevent collisions between images with
 identical headers
• Added enable_thinking parameter support in chat() and stream_chat() methods, forwarded to
 chat template application

vllm_mlx/models/mllm.py

5. vllm_mlx/engine/simple.py ✨ Enhancement +434/-384

Refactor blocking operations with cancellation-safe serialization

• Added _run_blocking_serialized() method to safely run blocking MLX operations under generation
 lock with proper cancellation handling
• Refactored generate() to use stream_generate() internally, enabling per-request SpecPrefill
 overrides for non-streaming clients
• Refactored chat() to route tool-stall workaround through streaming path and use
 _run_blocking_serialized() for both MLLM and LLM paths
• Added per-request enable_thinking override support in stream_chat() and
 _stream_generate_text(), falling back to environment variable
• Refactored _stream_generate_specprefill() and _stream_generate_text() to use
 _run_blocking_serialized() instead of direct asyncio.to_thread() calls

vllm_mlx/engine/simple.py

6. vllm_mlx/mllm_scheduler.py ✨ Enhancement +258/-33

Add MTP support, prefix cache, and detailed request monitoring

• Added configuration fields for KV cache memory limits, MTP speculative decoding, and prefix cache
 quantization
• Removed vision cache manager in favor of batch generator's built-in vision embedding cache
• Enhanced _get_stop_tokens() to read additional EOS tokens from generation_config.json
• Added request timing tracking (first_token_time) and per-request sampling parameters (top_k,
 min_p, presence_penalty, repetition_penalty)
• Implemented get_running_requests_info() for detailed per-request status endpoint with phase,
 progress, and throughput metrics
• Added memory management with periodic mx.clear_cache() and improved error handling for failed
 preprocessing
• Refactored _process_loop() to use thread pool executor for prefill-heavy steps while keeping
 decode-only steps inline

vllm_mlx/mllm_scheduler.py

7. scripts/add_mtp_weights_qwen35.py ✨ Enhancement +470/-0

Add MTP weight extraction and quantization script for Qwen3.5

• New script to add MTP (Multi-Token Prediction) weights to MLX Qwen3.5 models from HuggingFace BF16
 checkpoints
• Fetches shard index, downloads only MTP-containing shards with resume support, and extracts
 weights
• Handles both Dense (27B) and MoE (122B-A10B, 35B-A3B) architectures with expert weight stacking
• Applies norm shift (+1.0) for RMSNorm weights and optional quantization matching base model scheme
• Saves MTP weights to mtp/weights.safetensors subdirectory to avoid mlx_vlm glob loading
 conflicts

scripts/add_mtp_weights_qwen35.py

8. vllm_mlx/scheduler.py ✨ Enhancement +180/-31

Add prompt checkpoint support and repetition penalty handling

• Added mllm_prefill_step_size configuration override for MLLM prefill guard with validation
• Enhanced chunked prefill to support prompt checkpoints (positive values for token positions,
 non-positive for offsets)
• Added prompt_checkpoint_callback support and checkpoint tail replay before generation step
• Implemented fallback for missing mlx-lm private exports (Batch, _lazy_extract_cache) with
 compatibility checks
• Added make_logits_processors() import and per-request repetition penalty support via logits
 processors
• Fixed MTP RNN snapshot handling to only snapshot when not in optimistic mode
• Improved detokenizer pool management and added Metal buffer evaluation to prevent buildup

vllm_mlx/scheduler.py

9. vllm_mlx/patches/qwen3_5_mtp.py ✨ Enhancement +399/-0

Add runtime MTP injection for Qwen3.5 models

• New module providing runtime MTP injection for Qwen3.5 models without modifying mlx_lm source
• inject_mtp_support() creates MTP module, loads BF16 weights (no quantization for accuracy), and
 monkey-patches model class
• _fixup_moe_mtp() handles missing MoE weights by copying gates from main model and zeroing
 attention projections
• Adds return_hidden, mtp_forward(), and make_mtp_cache() methods to model class
• validate_mtp_support() checks for working MTP implementation with detailed logging

vllm_mlx/patches/qwen3_5_mtp.py

10. tests/test_batching.py 🧪 Tests +359/-1

Add chunked prefill checkpoint tests and async marker update

• Added comprehensive tests for chunked prefill with prompt checkpoints:
 test_chunked_prefill_accepts_prompt_checkpoints(),
 test_chunked_prefill_invokes_checkpoint_callback(),
 test_chunked_prefill_replays_checkpoint_tail_before_step()
• Added test for graceful handling of missing mlx-lm private exports:
 test_chunked_prefill_works_without_private_mlx_generate_exports()
• Changed async test marker from @pytest.mark.asyncio to @pytest.mark.anyio for broader async
 framework support

tests/test_batching.py

11. vllm_mlx/cli.py ✨ Enhancement +115/-3

Add GPU memory utilization and model download CLI options

• Added --gpu-memory-utilization flag (0.0-1.0, default 0.90) to control Metal allocation limits
 and emergency cache clear thresholds
• Added --mllm-prefill-step-size override flag for MLLM prefill guard configuration
• Added download options: --download-timeout, --download-retries, --offline for model
 pre-download with retry logic
• Implemented download_command() for standalone model downloading without starting server
• Refactored parser creation into create_parser() function for reusability
• Expanded tool-call parser options with new models: harmony, gpt-oss, gemma4, minimax
• Added pre-download step in serve_command() with timeout and retry configuration
• Added gpu_memory_utilization parameter to engine config and validation for valid range

vllm_mlx/cli.py

12. vllm_mlx/engine_core.py ✨ Enhancement +6/-11

Add dynamic GPU memory utilization configuration

• Added gpu_memory_utilization field to EngineConfig (default 0.90) for dynamic memory pressure
 threshold calculation
• Refactored memory pressure threshold calculation to use gpu_memory_utilization instead of fixed
 85% of max recommended working set
• Improved fallback memory threshold handling with better device memory detection

vllm_mlx/engine_core.py

13. vllm_mlx/memory_cache.py ✨ Enhancement +165/-21

Support RotatingKVCache trimming and quantization wrapper

• Refactored _trim_cache_offset() to handle RotatingKVCache (circular buffer) in addition to
 plain KVCache, with proper trimming logic that reorders temporal order and pads with zeros when
 needed
• Introduced _QuantizedCacheWrapper class to preserve original cache type metadata during
 quantization/dequantization roundtrips
• Updated _quantize_cache() and _dequantize_cache() to use the new wrapper and support multiple
 cache types (KVCache, RotatingKVCache, etc.)
• Fixed LCP (Longest Common Prefix) cache fetch logic by inverting the has_non_trimmable condition
 and adding debug logging for hybrid models

vllm_mlx/memory_cache.py

14. vllm_mlx/reasoning/think_parser.py ✨ Enhancement +99/-110

Implement state-machine streaming reasoning parser

• Refactored streaming parser from text-based detection to state-machine approach with three phases:
 pre_think, thinking, content
• Added reset_state() method and _phase tracking to avoid rescanning full accumulated text on
 every token
• Simplified extract_reasoning_streaming() logic by replacing helper methods with inline phase
 transitions
• Improved documentation with performance notes and clearer phase transition descriptions

vllm_mlx/reasoning/think_parser.py

15. tests/test_gemma4_tool_parser.py 🧪 Tests +240/-0

Add Gemma 4 tool parser test suite

• Added comprehensive test suite for Gemma4ToolParser with 25+ test cases covering extraction and
 streaming
• Tests cover single/multiple tool calls, nested objects, arrays, special characters, and edge cases
 like missing delimiters
• Includes streaming tests for buffering behavior and structured tool_calls emission on close
 delimiter
• Tests parser registration and SUPPORTS_NATIVE_TOOL_FORMAT flag

tests/test_gemma4_tool_parser.py

16. tests/test_server.py 🧪 Tests +258/-7

Add CLI and streaming chat completion tests

• Added TestServeCli class to test CLI argument parsing for tool call parser selection (harmony,
 gpt-oss aliases)
• Added TestStreamChatCompletion class with two async tests for reasoning stream with tool calls
 and plain content
• Tests verify tool_calls chunks are emitted after </think> and tool parser is skipped for
 non-markup content
• Minor formatting fix: changed f"Request {i+1}" to f"Request {i + 1}" and replaced
 asyncio.get_event_loop().run_until_complete() with asyncio.run()

tests/test_server.py

17. vllm_mlx/prefix_cache.py ✨ Enhancement +139/-77

Support multiple cache types in prefix cache reconstruction

• Refactored _extract_block_tensor_slice() to return per-layer metadata dicts instead of tuples,
 supporting both sequence-backed and recurrent cache types
• Added _can_concatenate_cache_state(), _slice_concat_cache_state(), and
 _concat_cache_states() helper methods for flexible cache handling
• Updated reconstruct_cache() to handle both concat (sequence-backed) and latest (recurrent)
 storage modes with proper type reconstruction
• Improved docstrings to clarify support for RotatingKVCache and other cache types beyond plain
 KVCache

vllm_mlx/prefix_cache.py

18. tests/test_simple_engine.py 🧪 Tests +164/-5

Add streaming aggregation and tool-enabled chat tests

• Added pytestmark = pytest.mark.anyio and anyio_backend fixture for async test compatibility
• Added stream_generate_side_effect to mock model to track concurrency alongside generate
• Added three new async tests: test_chat_with_tools_aggregates_streaming_path(),
 test_generate_accumulates_over_stream_generate(), and
 test_generate_empty_stream_returns_safe_default()
• Changed @pytest.mark.asyncio to @pytest.mark.anyio for consistency

tests/test_simple_engine.py

19. tests/test_reasoning_parser.py 🧪 Tests +266/-0

Add Gemma 4 reasoning parser tests

• Added TestGemma4Parser class with 20+ test cases for non-streaming and streaming reasoning
 extraction
• Tests cover standard format (<|channel>thought...<channel|>), alternative format, channel name
 stripping, and edge cases
• Streaming tests verify state transitions and proper handling of character-by-character token
 boundaries
• Updated docstring to mention Gemma 4 parser alongside Qwen3 and DeepSeek-R1

tests/test_reasoning_parser.py

20. vllm_mlx/utils/tokenizer.py ✨ Enhancement +107/-26

Optimize VLM loading and MTP injection for Qwen3.5

• Added _needs_strict_false() function to detect VLM models (e.g., Qwen3.5) up-front and avoid
 double-loading penalty
• Enhanced load_model_with_fallback() to call _needs_strict_false() before first load attempt
 and skip to _load_strict_false() for VLM models
• Improved _load_strict_false() with weight verification logging and proper memory cleanup
 (clearing traceback references, garbage collection)
• Updated _try_inject_mtp() to detect Qwen3.5 vs Qwen3-Next by checking model_type and load
 appropriate MTP patch
• Enhanced _try_inject_mtp_post_load() to check both flat and nested config paths and support new
 MTP weights directory structure
• Refactored _load_with_tokenizer_fallback() to use ensure_model_downloaded() helper with
 retry/timeout support

vllm_mlx/utils/tokenizer.py

21. vllm_mlx/tool_parsers/gemma4_tool_parser.py ✨ Enhancement +237/-0

Add Gemma 4 tool call parser implementation

• Implemented new Gemma4ToolParser class handling Gemma 4's native tool call format with
 <|tool_call> delimiters and <|"|> string tokens
• Added _find_balanced_brace() to handle brace matching while skipping over <|"|>-delimited
 strings
• Implemented _gemma4_args_to_json() three-step converter: extract strings to placeholders, quote
 bare keys, restore as JSON-escaped strings
• Supports both complete extraction and streaming with proper buffering during incomplete tool call
 blocks

vllm_mlx/tool_parsers/gemma4_tool_parser.py

22. tests/test_tool_parsers.py 🧪 Tests +177/-0

Add Qwen function format and streaming buffering tests

• Added Gemma4ToolParser to imports and registration test
• Added TestQwenFunctionFormat class with 3 tests for Qwen's <function=name> format support
• Added TestQwenStreamingBuffering class with 5 tests for partial marker buffering and false
 positive recovery
• Tests verify that partial markers like <function are buffered and content before them is emitted
 immediately

tests/test_tool_parsers.py

23. vllm_mlx/tool_parsers/qwen_tool_parser.py ✨ Enhancement +141/-2

Add Qwen function format and streaming buffering support

• Added _parse_param_value() helper to parse parameter values as JSON, Python literals, or plain
 strings
• Added support for Qwen's <function=name><parameter=key>value</parameter></function> format
 (Qwen3.5 native)
• Implemented partial marker buffering with _PARTIAL_MARKERS, _has_partial_marker(),
 _get_partial_marker_len(), and _was_buffering() methods
• Enhanced extract_tool_calls_streaming() to handle function-style format and buffer incomplete
 markers while emitting safe content

vllm_mlx/tool_parsers/qwen_tool_parser.py

24. tests/test_download.py 🧪 Tests +196/-0

Add download utility test suite

• Added comprehensive test suite for ensure_model_downloaded() with 6 test classes covering local
 paths, retry logic, offline mode, timeout, and allow patterns
• Tests verify retry behavior with exponential backoff, KeyboardInterrupt propagation, offline mode
 caching, and HF_HUB_DOWNLOAD_TIMEOUT environment variable handling
• Includes tests for LLM vs MLLM allow patterns and CLI download command argument parsing

tests/test_download.py

25. tests/test_normalize_messages.py 🧪 Tests +174/-0

Add message normalization test suite

• Added test suite for _normalize_messages() function with 13 test cases
• Tests cover merging consecutive same-role messages, developer role mapping to system, OpenCode
 format normalization, and edge cases
• Verifies that multimodal content and null content are not merged, and non-content fields are
 preserved

tests/test_normalize_messages.py

26. vllm_mlx/api/utils.py ✨ Enhancement +8/-1

Add Gemma 4 and Qwen3.5 multimodal support

• Extended SPECIAL_TOKENS_PATTERN regex to include Gemma 4 tool call tokens (</?tool_call>,
 </?tool_call_reasoning>) and other special markers
• Added <|tool_call> and <tool_call|> delimiters to TOOL_CALL_MARKERS tuple for Gemma 4
 support
• Added "gemma-4", "gemma4", "Qwen3.5-", and "qwen3_5" to MLLM_MODELS list for proper multimodal
 detection

vllm_mlx/api/utils.py

27. vllm_mlx/reasoning/__init__.py ✨ Enhancement +2/-0

 Register Gemma 4 reasoning parser
vllm_mlx/reasoning/init.py

28. tests/test_native_tool_format.py 🧪 Tests +2/-0

Update native tool format test for Gemma 4
• Added Gemma4ToolParser to imports
• Added Gemma4ToolParser to non_native_parsers list in test_parsers_without_native_support()
 test
tests/test_native_tool_format.py

29. LICENSE Additional files +176/-0

...

LICENSE

30. README.md Additional files +1/-0

...

README.md

31. benchmarks/bench_reasoning_parser.py Additional files +55/-0

...

benchmarks/bench_reasoning_parser.py

32. docs/reference/models.md Additional files +3/-2

...

docs/reference/models.md

33. pyproject.toml Additional files +4/-4

...

pyproject.toml

34. tests/conftest.py Additional files +6/-0

...

tests/conftest.py

35. tests/test_batched_engine.py Additional files +94/-0

...

tests/test_batched_engine.py

36. tests/test_batching_deterministic.py Additional files +11/-11

...

tests/test_batching_deterministic.py

37. tests/test_continuous_batching.py Additional files +1/-1

...

tests/test_continuous_batching.py

38. tests/test_gemma4_openai_format.py Additional files +160/-0

...

tests/test_gemma4_openai_format.py

39. tests/test_minimax_tool_calling.py Additional files +130/-0

...

tests/test_minimax_tool_calling.py

40. tests/test_mllm_continuous_batching.py Additional files +47/-0

...

tests/test_mllm_continuous_batching.py

41. tests/test_paged_cache.py Additional files +149/-0

...

tests/test_paged_cache.py

42. tests/test_simple_engine_cancel_serialization.py Additional files +143/-0

...

tests/test_simple_engine_cancel_serialization.py

43. tests/test_specprefill_rotating_cache.py Additional files +84/-0

...

tests/test_specprefill_rotating_cache.py

44. tests/test_streaming_latency.py Additional files +1/-1

...

tests/test_streaming_latency.py

45. tests/test_tokenizer_utils.py Additional files +54/-0

...

tests/test_tokenizer_utils.py

46. tests/test_tool_choice_none.py Additional files +65/-0

...

tests/test_tool_choice_none.py

47. vllm_mlx/api/anthropic_adapter.py Additional files +5/-0

...

vllm_mlx/api/anthropic_adapter.py

48. vllm_mlx/api/anthropic_models.py Additional files +3/-1

...

vllm_mlx/api/anthropic_models.py

49. vllm_mlx/api/models.py Additional files +18/-0

...

vllm_mlx/api/models.py

50. vllm_mlx/api/tool_calling.py Additional files +42/-0

...

vllm_mlx/api/tool_calling.py

51. vllm_mlx/models/llm.py Additional files +53/-6

...

vllm_mlx/models/llm.py

52. vllm_mlx/patches/gemma4_mllm.py Additional files +121/-0

...

vllm_mlx/patches/gemma4_mllm.py

53. vllm_mlx/patches/qwen3_5_mllm.py Additional files +120/-0

...

vllm_mlx/patches/qwen3_5_mllm.py

54. vllm_mlx/reasoning/gemma4_parser.py Additional files +170/-0

...

vllm_mlx/reasoning/gemma4_parser.py

55. vllm_mlx/request.py Additional files +1/-0

...

vllm_mlx/request.py

56. vllm_mlx/specprefill.py Additional files +5/-5

...

vllm_mlx/specprefill.py

57. vllm_mlx/text_model_from_vlm.py Additional files +17/-5

...

vllm_mlx/text_model_from_vlm.py

58. vllm_mlx/tool_parsers/__init__.py Additional files +6/-0

...

vllm_mlx/tool_parsers/init.py

59. vllm_mlx/tool_parsers/auto_tool_parser.py Additional files +23/-13

...

vllm_mlx/tool_parsers/auto_tool_parser.py

60. vllm_mlx/tool_parsers/minimax_tool_parser.py Additional files +172/-0

...

vllm_mlx/tool_parsers/minimax_tool_parser.py

61. vllm_mlx/utils/__init__.py Additional files +2/-1

...

vllm_mlx/utils/init.py

62. vllm_mlx/utils/download.py Additional files +144/-0

...

vllm_mlx/utils/download.py

qodo-code-review · 2026-04-14T02:01:09Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

🐞\ ≡ Correctness (1) ☼ Reliability (1) ➹ Performance (1)

1. Memory threshold uses RAM 🐞 ☼

Description

EngineCore computes its emergency cache-clear threshold from mx.device_info()['memory_size'] instead
of Metal’s max_recommended_working_set_size, so cache clearing can trigger far too late and lead to
OOM/Metal instability. This diverges from other memory sizing in the codebase that consistently uses
max_recommended_working_set_size for Metal limits.

Code

vllm_mlx/engine_core.py[R154-160]

+        # Emergency memory pressure threshold — dynamic based on gpu_memory_utilization
+        _gpu_mem_util = self.config.gpu_memory_utilization
        try:
-            _device_info = mx.device_info()
-            _max_recommended = _device_info.get(
-                "max_recommended_working_set_size",
-                _device_info.get("memory_size", 0),
-            )
-            _memory_pressure_threshold = (
-                int(_max_recommended * 0.85)
-                if _max_recommended > 0
-                else 200 * 1024 * 1024 * 1024
+            _device_mem = mx.device_info().get("memory_size", 200 * 1024 * 1024 * 1024)
+            _memory_pressure_threshold = int(
+                _device_mem * min(_gpu_mem_util + 0.05, 0.99)
            )

Evidence
EngineCore now derives the pressure threshold from total device memory, while other components use
Metal’s max recommended working set sizing, which is the safer bound for preventing command-buffer
failures under memory pressure.
vllm_mlx/engine_core.py[154-162]
vllm_mlx/engine/batched.py[358-378]
vllm_mlx/mllm_batch_generator.py[443-448]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`EngineCore`’s emergency memory pressure threshold is computed from `mx.device_info()['memory_size']` (physical memory), not `max_recommended_working_set_size`. On Metal, the recommended working set is the relevant ceiling; using physical memory can delay cache clearing until it’s too late, increasing risk of OOM / Metal command-buffer failures.

### Issue Context
Other parts of this repo already use `max_recommended_working_set_size` for memory sizing and limits (e.g., BatchedEngine’s `mx.set_memory_limit` and MLLM wired limit), so `EngineCore` should align with that source of truth.

### Fix Focus Areas
- vllm_mlx/engine_core.py[154-162]
- vllm_mlx/engine/batched.py[358-378]

### Implementation notes
- Prefer `max_recommended_working_set_size` when present; fall back to `memory_size` only if it’s missing/zero.
- Keep the `gpu_memory_utilization` scaling behavior, but scale the recommended working set, not physical memory.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Strict-false scans all weights 🐞 ➹

Description

_load_strict_false() counts all-zero tensors by running mx.all(v == 0) over every parameter, which
is O(total tensor elements) work and can severely slow startup for large models. This runs on every
strict=False load (the exact path intended for huge VLM/extra-weight models), undermining the goal
of faster/more reliable loading.

Code

vllm_mlx/utils/tokenizer.py[R129-134]

+    # Verify weights loaded correctly
+    from mlx.utils import tree_flatten
+
+    params = tree_flatten(model.parameters())
+    total_params = len(params)
+    zero_params = sum(1 for _, v in params if mx.all(v == 0).item())

Evidence
The strict=False loader now flattens all model parameters and performs a full-tensor reduction on
each to detect all-zero weights; this is expensive for multi-GB models and is in the default
strict=False path used by core engine/model loaders.
vllm_mlx/utils/tokenizer.py[114-145]
vllm_mlx/engine/batched.py[327-343]
vllm_mlx/models/llm.py[74-96]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_load_strict_false()` performs a full scan of every parameter tensor:
- `tree_flatten(model.parameters())`
- `mx.all(v == 0)` for each tensor

For large models (especially the VLM/extra-weights models that trigger strict=False), this adds significant startup latency and additional MLX/Metal work.

### Issue Context
This code runs inside the main model loading path (`load_model_with_fallback()`), which is called by both `BatchedEngine` and `vllm_mlx.models.llm.LLM.load()`.

### Fix Focus Areas
- vllm_mlx/utils/tokenizer.py[129-145]
- vllm_mlx/engine/batched.py[327-343]
- vllm_mlx/models/llm.py[74-96]

### Suggested fix
- Remove the full-parameter `mx.all(v == 0)` scan, or gate it behind a debug flag (e.g., env var) and/or `logger.isEnabledFor(logging.DEBUG)`.
- If you still want a sanity check, do a constant-time spot check (e.g., a couple known tensors like embeddings / lm_head) rather than iterating all tensors.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. presence_penalty ignored in Scheduler 🐞 ≡

Description

presence_penalty is added to SamplingParams and is forwarded from the server, but the LLM
continuous-batching Scheduler only constructs logits processors from repetition_penalty. As a
result, presence_penalty has no effect for the EngineCore/Scheduler path even though it’s accepted
by the API.

Code

vllm_mlx/request.py[R57-61]

    top_p: float = 0.9
    top_k: int = 0  # 0 means disabled
    min_p: float = 0.0
+    presence_penalty: float = 0.0
    repetition_penalty: float = 1.0

Evidence

SamplingParams now includes presence_penalty and server forwards it into generation kwargs, but
Scheduler’s per-request logits processor construction only passes repetition_penalty to
make_logits_processors; therefore presence_penalty is never applied on the LLM batching path. The
MLLM batching path in this repo already passes presence_penalty into make_logits_processors,
demonstrating intended support exists but is inconsistent.

vllm_mlx/request.py[51-63]
vllm_mlx/server.py[2063-2071]
vllm_mlx/scheduler.py[1915-1923]
vllm_mlx/mllm_batch_generator.py[1224-1238]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The server forwards `presence_penalty`, and `SamplingParams` stores it, but the LLM continuous batching `Scheduler` only builds logits processors for `repetition_penalty`. This means `presence_penalty` is silently ignored for the EngineCore/Scheduler path.

### Issue Context
The MLLM batching code (`MLLMBatchGenerator`) already supports both `repetition_penalty` and `presence_penalty` via `make_logits_processors(**lp_kwargs)`, so the LLM Scheduler should match that behavior.

### Fix Focus Areas
- vllm_mlx/scheduler.py[1915-1935]
- vllm_mlx/request.py[51-63]
- vllm_mlx/mllm_batch_generator.py[1224-1238]

### Suggested fix
- In `Scheduler._schedule_waiting()`, read `presence_penalty = request.sampling_params.presence_penalty`.
- If either penalty is non-default, call `make_logits_processors(repetition_penalty=..., presence_penalty=...)` with the non-default values.
- Keep existing behavior when values are defaults (1.0 for repetition, 0.0 for presence).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-14T02:05:32Z

+        # Emergency memory pressure threshold — dynamic based on gpu_memory_utilization
+        _gpu_mem_util = self.config.gpu_memory_utilization
        try:
-            _device_info = mx.device_info()
-            _max_recommended = _device_info.get(
-                "max_recommended_working_set_size",
-                _device_info.get("memory_size", 0),
-            )
-            _memory_pressure_threshold = (
-                int(_max_recommended * 0.85)
-                if _max_recommended > 0
-                else 200 * 1024 * 1024 * 1024
+            _device_mem = mx.device_info().get("memory_size", 200 * 1024 * 1024 * 1024)
+            _memory_pressure_threshold = int(
+                _device_mem * min(_gpu_mem_util + 0.05, 0.99)
            )


1. Memory threshold uses ram 🐞 Bug ☼ Reliability

EngineCore computes its emergency cache-clear threshold from mx.device_info()['memory_size'] instead of Metal’s max_recommended_working_set_size, so cache clearing can trigger far too late and lead to OOM/Metal instability. This diverges from other memory sizing in the codebase that consistently uses max_recommended_working_set_size for Metal limits.

Agent Prompt

### Issue description `EngineCore`’s emergency memory pressure threshold is computed from `mx.device_info()['memory_size']` (physical memory), not `max_recommended_working_set_size`. On Metal, the recommended working set is the relevant ceiling; using physical memory can delay cache clearing until it’s too late, increasing risk of OOM / Metal command-buffer failures. ### Issue Context Other parts of this repo already use `max_recommended_working_set_size` for memory sizing and limits (e.g., BatchedEngine’s `mx.set_memory_limit` and MLLM wired limit), so `EngineCore` should align with that source of truth. ### Fix Focus Areas - vllm_mlx/engine_core.py[154-162] - vllm_mlx/engine/batched.py[358-378] ### Implementation notes - Prefer `max_recommended_working_set_size` when present; fall back to `memory_size` only if it’s missing/zero. - Keep the `gpu_memory_utilization` scaling behavior, but scale the recommended working set, not physical memory.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

janhilgard and others added 30 commits February 15, 2026 18:14

fix: remove unused pytest import

af33ec9

style: format test file with black

d19a8d3

feat: add Gemma 4 multimodal model support

705586b

Add detection and inference support for Google's Gemma 4 models (e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision and audio capabilities via mlx-vlm >= 0.4.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: RotatingKVCache support in MLLM batching and missing return in t…

dc2279d

…okenizer - Accept RotatingKVCache (used by Gemma 4) in batch cache validation - Add missing return statement in load_model_with_fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Upgrade mlx-vlm and torchvision so Qwen3.5 multimodal will run

0c47a67

This depends on PR 215 or PR 243 being applied first.

feat: add Gemma 4 multimodal model support (#268)

a147fb5

feat: add Gemma 4 multimodal model support

Merge pull request #256 from janhilgard/fix/gemma4-batched-rope

95fe84d

fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache

Merge branch 'main' into fix-qwen3.5-mllm-startup

01c2779

fix: include tokens in GenerationOutput for BatchedEngine when model …

fc25c8c

…is mllm

Merge pull request #229 from mmcaulif/fix/batched-engine-tokens-field

b062186

fix: populate tokens field in BatchedEngine.generate()

fix(tests): apply black to batched engine generate test (#279)

b274f27

Merge pull request #276 from perry2of5/fix-qwen3.5-mllm-startup

3110d1a

Upgrade mlx-vlm and torchvision so Qwen3.5 multimodal will run

fix(reasoning): detect split think tags across deltas

3c00c44

integrate tool call parser into reasoning parser streaming path (#253)

660552e

* fix(server): integrate tool call parser into reasoning parser streaming path * use _model_name instead of request.model in reasoning tool chunk --------- Co-authored-by: Wayner Barrios <waybarrios@gmail.com>

Thump604 and others added 26 commits April 11, 2026 10:17

fix: import Path in tokenizer utils

e65f0e0

Merge pull request #281 from janhilgard/fix/qwen-tool-parser-function…

8a0d993

…-format Add <function=name> format support to Qwen tool parser

Merge pull request #283 from Thump604/codex/tokenizer-path-import

0b8a876

fix: import Path in tokenizer utils

add Rapid-MLX to README acknowledgments

78255e8

fix: skip optimistic mtp rnn snapshots

380b12d

Merge pull request #196 from Thump604/fix/mtp-optimistic-rnn-snapshot…

575f58b

…-leak fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak

Merge pull request #234 from penumbraforge/perf/reasoning-parser-stat…

1ffc4cf

…e-machine perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)

style: format MiniMax tool call tests

fd34746

Merge pull request #231 from sjswerdloff/feat/minimax-tool-call-parser

cb9a9bd

feat: add MiniMax tool call parsing support

feat: expose harmony tool parser in serve CLI

89cfad4

Merge pull request #284 from Thump604/codex/expose-harmony-tool-parse…

9fbdb7f

…r-restack cli: expose harmony and gpt-oss tool parsers

simple-engine: unify tool-enabled chat on streaming path (#10)

f16f854

* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts

fix: preserve streamed tool-chat token ids

51b4f69

remove dead token-encode block in tool-chat fallback

014edeb

The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.

style: format simple engine tool-chat test

040a724

test: align tool-chat aggregation regression

2990f7b

Merge pull request #285 from Thump604/codex/simpleengine-tool-chat-re…

6bb4142

…stack simple-engine: keep tool chat on the streaming execution path

fix(specprefill): avoid dense tail expansion within cache window (#291)

9d69467

test: use anyio in async regression slice (#288)

88b60b6

test: add tokenizer fallback regression coverage (#287)

2cce3da

resolve merge conflicts with main (enable_thinking + run_blocking_ser…

3f9cb1e

…ialized)

qodo-code-review Bot reviewed Apr 14, 2026

View reviewed changes

krystophny merged commit 3f9cb1e into computor-org:fix/chat-template-kwargs-forwarding Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge conflict resolution with main#36

merge conflict resolution with main#36
krystophny merged 62 commits intocomputor-org:fix/chat-template-kwargs-forwardingfrom
waybarrios:fix/chat-template-kwargs-forwarding

waybarrios commented Apr 14, 2026

Uh oh!

qodo-code-review Bot commented Apr 14, 2026

Uh oh!

qodo-code-review Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

waybarrios commented Apr 14, 2026

Summary

Uh oh!

qodo-code-review Bot commented Apr 14, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

qodo-code-review Bot commented Apr 14, 2026 •

edited

Loading